This report provides an evaluation of the accuracy and precision of probabilistic forecasts of COVID-19 cases and deaths submitted to the US COVID-19 Forecast Hub. Some analyses include forecasts submitted starting in April 2020. Others focus on evaluating “recent” forecasts, submitted only in the last 10 weeks.
In collaboration with the US Centers for Disease Control and Prevention (CDC), the COVID-19 Forecast hub collects short-term COVID-19 forecasts from dozens of research groups around the globe. Every Tuesday morning we combine the most recent forecasts from each team into a single “ensemble” forecast for each of the target submissions. This forecast is used as the official ensemble forecast of the CDC, typically appearing on their forecasting website on Wednesday.
The first table evaluates models based on their adjusted relative weighted interval scores (WIS, a measure of distributional accuracy), and adjusted relative mean absolute error (MAE). Scores are aggregated separately for the most recent 10 weeks and for all historical weeks. To account for the variation in difficult of forecasting different weeks and locations, a pairwise approach was used to calculated the relative adjusted WIS and MAE. Models with relative scores lower than 1 have been more accurate than the baseline on average, whereas relative scores greater than 1 indicate less accuracy than baseline on average.
The second table evaluates models based on their prediction interval coverage at the 50% and 95% levels. Scores are aggregated seperately for the most recent 10 weeks and for all historical weeks.
Inclusion criteria for each column are detailed below the table.
To calculate each column in our table, different inclusion criteria were applied. This table only includes models that have submitted forecasts for at least 50% of forecasts for the last 10 weeks or at least 50% of forecasts since the first week in April.
The column titled, “n recent forecasts” lists the number of forecasts a team has submitted with a target end date over the most recent 10 week period.
Columns 3 and 4 calculate the adjusted relative WIS over the most recent 10 week period and the adjusted relative MAE over the most recent 10 week period. For inclusion in these rows, a model must have forecasts for at least 50% of the evaluated forecasts in the most recent evaluation period.
Column 5 shows the number of historical models a team has submitted.
Columns 6 and 7 show the adjusted WIS and adjusted MAE over a historical period beginning the first week in March. For inclusion in this figure, a model must have submitted predictions for 50% or more of the evaluated forecasts in the historical evaluation period.
For inclusion in this table, a model most have contributed forecasts for 5 or more weeks total since the beginning of April, or have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, but also to evaluate new teams that have recently joined our forecasting efforts.
The data in this graph has been aggregated over all locations and submission weeks. The models included have submitted forecasts for at least 50% of forecasts out of the last 10 weeks.
In the following figures, we have evaluated models across multiple forecasting weeks. Points included in this comparison are for models that have submitted probabilistic forecasts for all 50 states.
For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all 50 states for submission weeks beginning the first week in April at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon.
To view a specific team, double click on the team names in the legend. To view a value on the plot, click on the point in the forecast of interest. To view a specific time of interest, highlight that section on the graph or use the zoom functionality.
In this figure, the dotted black line represents the average 1 week ahead error. There is often larger error for the 4 week horizon compared to the 1 week horizon.
We would expect a well calibrated model to have a value of 95% in this plot.
We would expect a well calibrated model to have a value of 95% in this plot. There is typically larger error for the 4 week horizon compared to the 1 week horizon.
This figures below show model performance stratified by location In this figure, we only include models that have submitted forecasts for all 4 horizons and at least 50% of past 10 evaluated weeks.
The color scheme shows the WIS score relative to the baseline. The only locations evaluated are 50 states and a national level forecast.
This figure shows the number of incident reported COVID-19 cases reported each week in the US. The period between the vertical blue lines shows the weeks included in the “recent” model evaluations.
The first table below evaluates models based on their adjusted relative weighted interval scores (WIS, a measure of distributional accuracy), and adjusted relative mean absolute error (MAE). Scores are aggregated separately for the most recent 10 weeks and for all historical weeks. To account for the variation in difficult of forecasting different weeks and locations, a pairwise approach was used to calculated the relative adjusted WIS and MAE. Models with relative scores lower than 1 have been more accurate than the baseline on average, whereas relative scores greater than 1 indicate less accuracy than baseline on average.
The second table evaluates models based on their prediction interval coverage at the 50% and 95% levels. Scores are aggregated seperately for the most recent 10 weeks and for all historical weeks.
Inclusion criteria for each column are detailed below the table.
In this table, we have included all models with an eligible WIS or MAE score.
In order to meet eligibility for adjusted relative WIS or MAE over the most recent 10 week period, a model must have submitted forecasts 50% or more of the evaluated forecasts in the most recent evaluation period. WIS was only calculated for teams that submitted all required quantiles.
In order to be eligible for the historical calculation of MAE or WIS, a model must have predictions for 50% or more of the evaluated forecasts in the historical evaluation period.
For inclusion in this table, a model most have contributed forecasts for 5 or more weeks total since the beginning of April, or have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, but also to evaluate new teams that have recently joined our forecasting efforts.
The data in this graph has been aggregated over all locations and submission weeks. The models included have submitted forecasts for at least 50% of forecasts out of the last 10 weeks.
In the following figures, we have evaluated models across multiple forecasting weeks. The models included in this comparison must have submitted forecasts for all 50 states and at a national level for each timepoint.
For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all locations for each submission week at a 1 week horizon. The second figure shows the mean WIS aggregated across locations for a 4 week horizon.
To view a specific team, double click on the team names in the legend. To view a value on the plot, click on the point in the forecast of interest.To view a specific time of interest, highlight that section on the graph or use the zoom functionality.
In this figure, the dotted black line represents the average 1 week ahead error. There is larger variation in error for the 4 week horizon compared to the 1 week horizon.
The black line represents 95%
The black line represents 95%
This figures below show model performance stratified by location In this figure, we only include models that have submitted forecasts for all 4 horizons and at least 50% of the last 10 evaluated weeks.
The color scheme shows the WIS score relative to the baseline. The only locations evaluated are 50 states and a national level forecast.
This plot shows the observed number of incident deaths over time in the US. The period between the vertical blue lines shows the weeks included in the “recent” model evaluations.